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n ^ BS rR s AC - T Comparative modeling meth- 
ods arc descri bed that can be used to construct 
a tnrec-dimensional model structure of a new 
protein irom knowledge of its sequence and of 
the experimental structures and sequences of 
other members of its homology family. The 
methods are illustrated with the mammalian 
serine protease family, for which seven experi- 
mental structures have been reported in the lit- 
erature, and the sequences for over 35 different 
protein members of the family are available. 
The strategy for* modeling these proteins is pre- 
sented, and criteria are developed for determin- 
ing and assigning the reliability of the modeled 
structure* Criteria are described that are spe- 
cially designed to help detect cases in which it 
is likely that the local structure diverges signif- 
icantly from the usual conformation of the fam- 
ily. 

Key words: protein structure, computer model- 
ing, structure prediction, sequence 
homology, structure homology 

INTRODUCTION 

It has been apparent for close to two decades that 
proteins from very different sources and sometimes 
with rather diverse functions can have homologous 
sequences and consequently very similar three- 
dimensional structures. 1 This fact has been the ba- 
sis for the development of comparative modeling 
methods 2 ~ 7 which permit extrapolation. from the ex- 
perimentally determined structure for one or more 
members of a homologous family to a new member of 
this family whose sequence has been determined but 
whose structure is as yet unknown. A number of 
factors combine to increase greatly the application 
of comparative modeling techniques today. The 
large number of protein structures 8 and the explod- 
ing number of protein sequences 9 that are being re- 
ported in the literature and that are readily avail- 
able in computerized databases provide the basic 
structural and sequence data needed to apply the 
method The proteins in which we are interested are 

«. i « ki*> in only small quantities, too little for 

often available m ^ j j mPMA • 

i studies unless the gene or mKJNA is 
structural si ; ^ esged Even when this 

cloned or sy^«« ^ worthwhile and is initiated, 
latter effort is ci<^ 
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the comparative modeling studies can be performed 
in the meantime, providing an approximate view of 
the structure of the molecule until sufficient protein 
can be obtained and the experimental structure can 
be determined. Such a model structure can be very 
useful in the interim to help plan and interpret 
biochemical 10 and mutagenesis 11 experiments, to 
probe functional properties, 12 and improve our un- 
derstanding of ligand or substrate binding. 13 " 15 

The use of comparative methods involves extrap- 
olation from one or more known structures to con- 
struct a new structure. It is important to consider 
how accurately this function can be performed in 
assessing whether it is a useful and worthwhile ex- 
ercise. Certainly, comparisons that have been pub- 
lished between predicted model structures and the 
experimental structures determined later 16-18 indi- 
cate that the modeled structure is not likely to be 
completely accurate. This paper reports the methods 
that we have developed to perform comparative 
modeling arid the criteria that we use to estimate 
the reliability of various parts of the structure and 
especially to identify potential problem areas that 
are likely to be particularly difficult to predict cor- 
rectly. 

To describe properly and illustrate the modeling 
techniques, we will use the mammalian serine pro- 
tease family. 19 Members of this family are ubiqui- 
tous in nature; they are present all along the evolu- 
tionary pathway from bacteria to humans. They 
play important roles in a wide variety of body func- 
tions, including blood coagulation, fibrinolysis, com- 
plement activation, fertilization, and digestion. 
Therefore, modeling the various members of this 
family can provide valuable new information about 
divers critical biological functions of the body. 

Comparative modeling works best when there are 
several experimental structures and known se- 
quences. Experimental structures for seven differ- 
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TABLE I. Experimentally Known Three- 
Dimensional Structures of Serine Proteases 







Resolution 


Refer- 


Protein 


Source 


(A) 


ence 


Chymotrypsin 


Bovine 


1.8 


24* 


Trypsin 


Bovine 


1.8 


25* 


Elastase 


Porcine 


1.8 


26 


Kallikrein 


Porcine 


1.8 


27 


Mast cell protease 


Rat 


1.8 


28 


S. griseus trypsin 


S. griseus 


1.7 


23 


Tonin 


Rat 


1.8 


29 



*Structures have been determined by several groups for each of 
these proteins. The reference shown corresponds to the coordi- 
nates used in this work. 



ent serine proteases* that can be considered mem- 
bers of this mammalian family have been reported 
in the literature and deposited in the Brookhaven 
Protein Structure Database. 8 These include chy- 
motrypsin, 24 trypsin, 25 elastase, 26 kallikrein, 27 rat 
mast cell, 28 Streptomyces griseus trypsin-like 
protein, 23 and tonin 29 (see Table I). Several of these 
have been solved independently in more than one 
laboratory either in the same form or in alternate 
forms of the molecule. 30 " 32 In addition, structures 
have been reported for the precursor, zymogen, form 
of chymotrypsin 33,34 and trypsin. 35 * 36 To comple- 
ment these structures, amino acid sequences for 
over 35 different serine proteases have been re- 
ported (Table II), not counting species variations. 
Thus this family presents an ideal system for devel- 
oping comparative modeling methods and at the 
same time applying them to proteins involved in im- 
portant and interesting biological functions. 

COMPARATIVE MODELING METHOD: THE 
"SPARE PARTS" ALGORITHM 

Comparative modeling requires extrapolation 
from known structures to produce a model of an un- 
known structure. Our ability to extrapolate accu- 
rately is, unfortunately, greatly limited by our still 
rudimentary knowledge and understanding of pro- 
. tein structure and energetics. Consequently, the 
techniques that are used to extrapolate influence 
the resultant model critically and may introduce 
considerable error. The methods that we have devel- 
oped attempt to systematize the modeling process by 
combining all the known structures and sequences 
to help improve the accuracy of the extrapolation. 



v ,i .^Structures have also been reported for several serine pro- 

^ 'i&ases from bacterial sources. 2 ^- 22 Although these are cer- 
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Analysis of the Structural Properties 
of the Family 

The first steps in the modeling process are illus- 
trated schematically in Figure 1. Let us assume that 
we have experimentally determined three-dimen- 
sional structures for several members of the homol- 
ogous family of interest, e.g., structures "A," "B," 
and "C" in Figure la. The known structures are su- 
perimposed in three dimensions to obtain a maximal 
overlap of the structures (Fig. lb). Performing this 
superposition of the structures is sometimes not a 
trivial procedure. This is true because we want to 
overlap the molecules based on the parts that are 
the same and ignore the parts that are different. The 
more different the molecules are from each other, 
the harder it is to find a unique best overlap of the 
common features of the structures. Several pro- 
grams have been written that do this analyti- 
cally 37 - 38 

Once superimposed, there are parts of the known 
structures that overlap very well, indicating that 
the structures are closely conserved in these regions 
(see bold areas in Fig. lb). We call these portions 
"structurally conserved regions," or SCRs, and ex- 
pect that they will usually remain conserved in all 
members of this homologous family. These regions 
are usually composed of the secondary structure el- 
ements, the immediate active site, and other essen- 
tial structural framework residues of the molecule. 
Between these conserved elements are highly vari- 
able stretches that differ significantly from one 
member of the family to the next. These are called 
"variable regions," or VRs. They are almost always 
loops that lie on the external surface of the protein, 
and they contain all the additions and deletions be- 
tween different protein sequences. 

After the structures have been overlapped in 
three dimensions and parsed into SCRs and VRs, the 
next step is to align their amino acid sequences. For 
purposes of comparative modeling, the sequence 
alignment is done differently from the usual 
methods. 39 Instead of relying on criteria such as 
amino acid identity or homology, we use strictly the 
three-dimensional overlap as the criterion. When 
the ot-carbons of the respective residues in the over- 
lapped protein structures occupy the same place in 
three-dimensional space, then the residues are cor- 
responded in the sequence alignment (Fig. 2). Thus 
the alignment is primarily concerned with the 
SCRs, since these are the portions of the structures 
that are basically the same in all protein members of 
the family. The alignment in the VRs is often arbi- 
trary, unless two or more structures have similar 
conformations in a particular VR (for an example of 
this see the VR in the upper left of the structures 
!*A" (sequence C-D-F-A) and "C" (sequence C-R-Y-V) 
in Figs. 1 and 2). 

; ^The. resultant sequence alignment is then scruti- 
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TABLE II. Sequences of the Serine Protea ses* 

Code Source 



COMPARATIVE MODELING TECHNIQUES 



Protein 



Chymotrypsin 
Trypsin 
Elastase 
Kallikrein 
Mast cell protease 
S. griseus trypsin 
Tonin 

Haptoglobin heavy chain 
Protein Z 
Protein C 

Nerve growth factor a. chain 
Nerve growth factor 7 chain 
Blood clotting factor VII 
Blood clotting factor IX 
Blood clotting factor X 
Blood clotting factor XI 
Blood clotting factor XII 
Plasmin 

Apolipoprotein A 

Tissue plasminogen activator 

Urokinase 

Thrombin 

Complement factor B 
Complement factor 2 
Complement factor D 
Complement factor 1R 
Complement factor IS 
Adipocyte protease 
Cathepsin G 
T-cell serine protease 
Hannuka factor 

Cytotoxic T lymphocyte protease 

Batroxobin 

EGF binding protein 

Drosophila snake locus 



CHTt 
TRPt 
EL At 
KALf 
MCPt 
SGTf 
TONt 
HPH 
PRZ 
PRC 
NGA 
NGG 
VII 
FIX 
FAX 
FXI 
XII 
PLM 
ALP 
TPA 
UKH 
THR 
CFB 
CF2 
CFD 
C1R 
CIS 
ASP 
CAG 
TCL 
HAF 
CTL 
BTX 
EBP 
DSN 



Bovine 
Bovine 
Porcine 
Porcine 
Rat 

S. griseus 
Rat 

Human 
Bovine 
Human 
Human 
Human 
Human 
Human 
Human 
Human 
Human 
Human 
Human 
Human 
Human 
Human 
Human 
Human 
Human 
Human 
Human 
Mouse 
Human 
Mouse 
Human 
Mouse 
Snake 
Mouse 
Drosophila 



*The sequences are taken from the sequence database * 
LllntlT*i r0teinS f ° r WhiCh a ^-^nsional structure is 
^^^^^ 13 ^ fr ° m the *rookhaven struc- 



nized carefully to identify strong stretches or pat- 
terns of sequence homology that are characteristic 
of each SCR in the structure. For example, the se- 
quence "L-S/T-V-.-I-.," where » is a charged or po- 
lar residue, is conserved in the second SCR in Figure 

th^SV 0 " 1 ^" 18 f ° Und in aI1 members a "he 
trurd bCR. However, in the fourth SCR, the situa- 
tion is less clear. A proline usually appears in the 
second position but may be replaced by a glycine 
Most SCRs can be identified by a characteristic (not 
necessarily contiguous) sequence pattern. In some 
cases, there is a pattern of hydrophobic or hydro- 
phihc residues rather than specific side chains. 
Careful analysis of the sequences and the corre- 
spending three-dimensional structures will usually 
allow some pattern to be discerned. 

The identification of these sequence homology 
patterns is a crucial step in the modeling. As is 
shown below, it is essential for the correct alignment 
of a new sequence. If the characteristic sequence 
pattern is not present, then it is not clear how to 
align the "new" sequence and, consequently, how to 




' ZT 2 \ ^ m ° del StrUCtUre - This is one o«§^ 
sons that the comparative modeling methdo^ 
rently depends so heavily on the retenS^fl 
quence homology. The above steps are peSbnnBl' 
once for a protein family and need be reSmM 
only as new experimental structures and the* £ 
quences become available. 

Construction of a "New" Structure 

We are now ready to begin modeling a "new" pro- 
tern sequence of interest that is a clear member of 
this homologous family. The first step is to align th£ 
new sequence to the sequences of the SClJusing 
the ^ previously identified characteristic sequence ho 
mology patterns. It should be possible to align an 

L-I-V-K-I-E fits the pattern L-S/T-V-tt-I-tt de- 
scribed above and can be aligned to the SCR at po- 
sitions 7-12 in Figure 2 and similarly for the G-I-A 
sequence of the SCR at 16-18. There will occasion- 
ally be more than one possible alignment for a par- 
ticular SCR, especially if the sequence pattern is a 
weak one Such an example might occur in the SCR 
corresponding to residues 1-2 (Fig. 2). In such cases, 
the different possible alignments would have to be 
considered. As each portion of the "new" sequence is 
aligned to an SCR, the rest of the positions in that 
bOR are filled with the adjacent "new" sequence 
without permitting any additions or deletions 
within the SCR.t 

^ The remaining residues make up the VRs. The 
'new" sequence in each VR is examined to see if it 
corresponds in length and residue character to one of 
the VRs of the known structures. For example, the 
third VR in the "new" protein in Figures 1 and 2, 
residues 13-15, has the same sequence length and 
character as this VR in protein "B" (A-W-D-S-L in 
"new" vs. A- W-N-T-M in "B") and therefore has been 
aligned to it. On the other hand, none of the known 
structures has a sequence that corresponds to resi- 
dues 3-6 of the "new" protein in residue length. 

Once the "new" sequence is aligned, as shown in 
Figure 2, the model building can begin. For the 
SCRs, usually the main chain coordinates from any 
one of the known structures can be taken (Fig. 1c). 
The side chains are mutated to those of the "new" 
sequence wherever necessary. In our implementa- 
tion, the Xi angle for the "new" side chain is chosen 
to maximize overlap of this side chain on the old one. 
Further side chain x angles are not fitted at this 



t An exception would occur if there are too few residues be- 
tween this SCR and the neighboring SCR. In that case, the end 
residues of an SCR would be lea with a deletion. For examples, 
see position 96 in PLM and ALP in Figure 4. 
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to lt>e "new" sequence where necessary, d: The various VR con- 
formations found in the known structures are considered for each 
VR of the "new" protein. (Compare the respective VRs shown 
here with those in a and b.) The ones that do not fit are rejected, 
as shown by the crossed arrows. The most suitable is selected in 
each case. In some cases, other conformational search 4 1-45 or 
energetics* 6 47 methods must be employed, since no suitable con- 
formation can be found for that VR among the known structures 
(see VR in \x^per feft corner with the sequence C-F-N-L-Q for an 
example in which a different, new conformation is necessary), e: 
The composite structure showing the source of the respective 
"spare parts'* selected for the model structure. 



# ; 

A: 
8: 
C: 



■ y : jA v n - -£Ts ijr - - ski ?j 
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i 
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Fig. 2. Sequence alignment for the set of proteins in the sche- 
matic homologous family corresponding to Figure 1. The. align- 
ment is performed based solely on the overlap of the three-dimen- 
sional structures and not by sequence alignment methods. The 
boxes delineate the SCRs as determined from the three-dimerv- 
sional structure overlap. The sequence of a "new" prpteihi is 
aligned based on the characteristic patterns of sequence; hornqi- 
ogy found for the known structures and their sequences (seetext). . 
Tbe single letter amino acid codes are given in Figure 4. . : ■ 
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stage, because the nature of the different side chains 
usually diverges beyond this point. An automatic 
angle-scanning energy-optimizing routine is usually 
not employed, because we find that it often selects 
an unsuitable conformation based on minor or insig- 
nificant differences in energy. However, incorpora- 
tion of recently compiled and reported rotamer 
libraries 40 may improve our ability to automate the 
side chain conformation selection process. 

For those VRs for which an appropriate example 
structure can be selected from among the known 
structures during the alignment process described 
above, the main chain fragment is taken directly 
from the VR of that known structure, and again the 
side chains are mutated to fit the "new" sequence 
(see examples in Fig. Id and e). Remaining VRs, 
such as the VR at positions 3—6 in Figures 1 and 2, 
have no good example to build upon among the 
known structures. Therefore, they have to be con- 
structed by more complex conformation search 
methods 41-45 and energetic considerations. 46,47 We 
have also used the protein database as a source of 
tentative starting conformations for loops of this 
type. The two ends of the particular loop that lie in 
the adjacent SCRs are taken, and the Brookhaven 
Protein Structure Database 8 is searched for struc- 
tural fragments that closely match the conformation 
of these ends (specifically, the a-carbon positions) 
and have the appropriate number of residues be- 
tween the ends. Thus, for example, in the VR at 
positions 3-6 of the "new" sequence, the structures 
of the two ends corresponding to residues "A-D-A" 
(positions - 1 to 2) and "L-T-V- . . (positions 7 to 
9. . .) are selected and typically ten structures that 
match these two ends best [lowest root mean square 
(rms) deviations] and that have five residues in be- 
tween are chosen and examined. Obviously, any 
such conformation that would collide with the re- 
mainder of the constructed "new" structure is elim- 
inated. Similarly, if the conformation does not pack 
well or buries a single charged group, it is rejected. 
In this way, one or a small number of initial confor- 
mations can be selected for this loop. Clearly this 
latter method does not produce an exhaustive list; 
however, it does provide loop conformations that are 
known to appear in other proteins. This method has 
previously been described by Kraulis and Jones 48 
for constructing protein structures from fragments 
using nuclear magnetic resonance (NMR) data or 
using crystallographically generated electron den- 
sity maps. 

It is important to emphasize that it is the initial 
spatial alignment of the three-dimensional struc- 
tures performed above (Fig. lb) that allows this con- 
venient clipping. of fragments or "spare parts" from 
/the various known structures to construct a compos- 
ite model structure of the "new" protein. Thus over- 
lapping: of the structures is essential for* two. crucial 
steps: for;, the original ■ alignments of ; the > sequences 



and for clipping together fragments in the subse- 
quent construction of each "new" model protein 
structure. Note that, the more experimentally 
known structures available, the more likely the 
boundaries of the SCRs will be well-defined and the 
greater the likelihood that a suitable example 
known structure, i.e., spare part, will be found for 
each "new" VR among the known structures. 

Because the different portions of the "new" con- 
structed model structure arise from quite different 
sources (Fig. le), we can assign appropriate qualita- 
tive reliability confidence levels to the respective 
parts of the structure. Clearly, the conformations in 
the SCRs are the most reliable, especially when sev- 
eral known structures are available and have been 
compared in detail. This is true only if the respective 
SCRs of the "new" sequence retain the characteristic 
homologous sequence patterns in those SCRs. When 
they do not, and this is illustrated below, it should 
be regarded as a red flag warning that something 
different may be happening in this region of the par- 
ticular "new" protein. In such cases, great care must 
be taken in constructing the "new" structure and the 
confidence level in this portion of the molecule re- 
duced appropriately. For the VRs, when a good 
model structure appears for the respective VR 
among the known structures, then a fair confidence 
level can be assigned. This level is not as high as in 
the SCRs because of the greater inherent variability 
of these loop regions. The actual confidence level for 
each such VR would depend on the size of the loop 
(larger loops have more degrees of freedom and are 
therefore less reliably determined), how good the se- 
quence homology is to the known structure selected 
to model this VR, and how well the chosen confor- 
mation fits and packs onto the rest of the "new" 
structure. The confidence is lowest for those VRs 
with no good model loop among the known struc- 
tures. When conformational methods or database 
search techniques have to be employed, ex- 
perience 16 " 18 shows that we are not able to predict 
the conformation reliably in many of these cases. 

The resultant structure is then examined in detail 
to identify possible serious errors. For example, the 
inner cores of the structure are checked to be sure 
that no inappropriate charges have been buried. The 
modeling method makes it unlikely that there are 
serious steric contacts between main chain atoms in 
the model. This is because the main chain coordi- 
nates in the SCRs were taken from the known struc- 
tures. Only if the bad contact was present in the 
parent known structures will it be found in the 
model. In the VRs, absence of main chain overlap 
was a major criterion for acceptance of a conforma- 
tion. Therefore, in practice, bad steric contacts can 
usually be relieved by side chain rotations; occasion- 
ally by having to select an alternative conformation 
for a VR: The model structure can then be intro- 
duced in£o an energy-minimization program 46,47 to 



Fig. 3. a-Carbon plots showing the superposition of the seven 
experimentally known serine protease structures (see Table I). 
Note the SCRs, where the seven structures are virtually the same, 
and the VRs, where the structural divergence is considerable. 



The key to the plots is as follows: CHT, green; TRP, solid red 
ELA, solid blue; KAL, solid cyan; MCP, dotted green; TON. dottec 
red; SGT, dotted blue. 



relieve any remaining minor steric contacts and to 
optimize the bond angles and torsional angles. Typ- 
ically, we begin with restrained or template-forced 49 
minimization, forcing the atomic coordinates to re- 
main close to their initial positions. This allows bad 
contacts to relax without introducing large artifac- 
tual distortions into the structure. After several 
hundred cycles, the constraints can be progressively 
removed and the energy of the molecule minimized. 

Experience with comparative molecular mod- 
eling 16 " 18 has shown that these models are never 
completely accurate, especially in the conformations 
selected for the VRs. Therefore, any uses of the 
model structure and predictions that are made based 
on it should take into consideration this fact and the 
local reliability confidence levels discussed above. 
We always regard such a model as a working hy- 
pothesis; as new data emerge, whether from struc- 
tural, spectroscopic, or biochemical sources, the 
model is modified, corrected, improved, and refined 
to fit the new data. 

RESULTS AND DISCUSSION 
Initial Alignment of Known Serine 
Protease Structures 

Our analysis of the serine proteases began some 
years ago with the superposition of the then three 
known structures of chymotrypsin, trypsin, and 
elastase. 3 - 4 Since then, coordinates for four more 



structures have appeared in the literature and data 
bank 8 : kallikrein, 27 mast cell protease, 28 Streptomy- 
ces griseus trypsin-like protease, 23 and tonin. 29 The 
overlap of these additional structures was performed 
by first superimposing the added experimental 
structure onto the rest using the a-carbon positions 
of residues 195, 57, and 102.$ The superposition was 
refined, when necessary, using repeated cycles of 
least squares fit on related residue a-carbon posi- 
tions between the protein to be fit and chymotryp- 
sin. Only residues whose positions were conserved in 
the two structures were fitted in the calculation; 
that is, any residue was eliminated from the fitting 
if the deviation in a-carbon positions was more than 
2 A on the first cycle and then more than 1.5 A on 
subsequent cycles. Figure 3 shows the final overlap 
obtained for these seven structures. Virtually all the 
SCR positions identified in the previous analysis of 
the first three proteins 4 remain structurally con- 
served. The repertoire of conformations, i.e. "spare 
parts," found for each of the VRs is increased with 
the larger number of structures, as expected (Table 
HI). 



$The residue numbering given for the serine proteases fol- 
lows that of the chymotrypsinogen molecule throughout this 
paper, in the text, in the tables, and in the figures. 
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TABLE III. Sequence Lengths of the Different VRs in the Serine Proteases 



VR 












VR 










35-41 


59-62 


72-80 


97-101 


125-133 


146-151 


166-179 


185-189 


203-206 


217-224 


CHT 


7 


4 


9 


5 


9 


4 


14 


3 


4 


8 


TRP 


5 


3 


9 


5 


7 


6 


14 


5 


0 


8 


ELA 


10 


5 


9 


7 


9 


5 


16 


4 


4 


10 


KAL 


6 


3 


9 


9 


7 


8 


14 


5 


0 


9 


MCP 


10 


3 


9 


5 


9 


5 


13 


5 


0 


6 


SGT 


2 


8 


7 


3 


7 


5 


15 


6 


5 


8 


TON 


4 


3 


9 


16 


7 


6 


14 


5 


0 


9 



HPH 


6 


16 


0 


1 


8 


4 


31 


5 


6 


7 


PRZ 


7 


1 


4 


5 


11 


2 


13 


3 


4 


6 


PRC 


7 


5 


9 


5 


13 


10 


14 


5 


4 


8 


NGA 


6 


3 


9 


16 


7 


6 


14 


5 


0 


9 


NGG 


6 


3 


9 


12 


7 


6 


14 


5 


0 


9 


VII 


6 


8 


9 


5 


12 


5 


19 


5 


4 


8 


FIX 


6 


5 


9 


7 


11 


5 


14 


5 


4 


8 


FAX 


7 


5 


8 


5 


11 


5 


14 


5 


4 


8 


FXI 


9 


8 


9 


5 


9 


5 


15 


5 


3 


8 


xn 


5 


8 


9 


5 


9 


6 


16 


5 


7 


8 


PLM 


7 


8 


9 


-1 


9 


4 


16 


5 


4 


8 


ALP 


7 


8 


9 


-1 


9 


4 


7 


5 


4 


8 


TPA 


12 


8 


9 


5 


9 


6 


16 


11 


4 


8 


UKH 


11 


8 


9 


7 


9 


6 


16 


5 


4 


8 


THR 


8 


13 


10 


6 


12 


11 


14 


8 


6 


8 


CFB 


10 


8 


1 


14 


25 


3 


27 


7 


4 


14 


CF2 


6 


8 


5 


14 


25 


0 


28 


4 


4 


21 


CFD 
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Based on the above superposition of structures, 
the seven sequences were aligned as described above 
(Fig. 4). Note the wide variation in the length of the 
different sequences in each of the VRs (Table HI). 
Using this sequence alignment, we attempted to 
identify the critical conserved stretches of amino 
acid sequence. These are characteristic of each SCR 
and are essential for aligning "new" sequences to 
She known ones, which is the first step in modeling. 
. The patterns that were chosen are noted in Figure 4 
: at the bottom of the sequence list in the row labeled 
.^CJON." In some cases, there was almost complete 
v Conservation of a single amino acid. In other cases, 
p| *only a hydrophobic side chain (typical of an internal 
^jjosition), seems to be required, 
y.^;. Because the identification of these stretches is 
'-ki^V^siich an important part of the modeling process, it is 
^^>^v'^orthwhile describing in more detail how this is 
1^^®^™™®^* mos t cases, it is straightforward; the 
'^^^^j^j^i^^ ke^iiitice is readily apparent... Ifer,exam- 
^M^^^&$^^<^ristic ifit-VV.-G-G atlpositionMS^ 

^fe^M^-lrV'^-' • :■■ . ■ .-i j.-;.. *. 

wmm£M±P- - ; " • - ■ ,--.-:r-"v 



19 (see Fig. 4) defines the first SCR very clearly. 
Similarly, the remarkably conserved S/T/A-G-W-G 
sequence of residues 139-142 is highly characteris- 
tic of this SCR. On the other hand, the SCRs at 63- 
71 and 81-96 have no such clearly conserved se- 
quence, and it is often very difficult to align a "new" 
sequence to the known ones in these SCRs unambig- 
uously. 

To understand better why this latter group of 
SCRs has so little sequence homology, the structures 
of these two regions were examined in detail (Fig. 5). 
It is immediately apparent that these SCRs lie on 
the surface of the molecule, and virtually all the 
residues point out into solvent and thus can and do 
vary with little consequent effect on the rest of the 
structure. However, in each SCR, there are some 
side chains that point into the interior of the mole- 
cule ) and thus are more conserved. In the SCR at 
63:^71* . this includes : the ,Vak(or> comparable ali- 
phatic residue) at 66. and, the' aliphatic side chain at 
; '.68.; t Tjhie .Gly at position 69 appears to be almost com- 
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Fig. 5. The location is shown for some of the SCRs that are 
difficult to align in the serine proteases because of only weak 
characteristic patterns of homology. This is usually due to the 
SCR lying on the surface of the molecule. This figure presents the 
Ca plot of CHT (dotted lines), with the location shown of the 
respective SCRs that are difficult to aJign: 63-71 and 81-96 (solid 
lines). Note that these SCRs lie on the surface of the molecule and 



that most of their side chains point out into solvent and thus are 
free to vary more rapidly than internal residues. In the SCR at 
63-71, conserved residues 66 and 68 are the only internal ones 
and are always hydrophobic; 69 is always a Gly, because it forms 
a turn. In the SCRs at 81-96, internal hydrophobic residues are 
83, 85, 89, and 94. Partial conservation is observed at positions 
91 and 92, with a His-Pro. 



pletely conserved, probably because it forms a turn 
and adopts a conformation (4> = ~60°, i|» = ~25°) 
that is permitted only for Gly (Fig. 5). Similarly, 
small hydrophobic side chains occur at positions 83 
and 85 in the SCR at residues 81-86 (Fig. 5). More 
varied but always hydrophobic residues appear at 
89, and an aromatic one is typical at 94, both of 
which point into the molecule. In addition, there is 
also a tendency for a His-Pro at positions 91-92 (Fig. 
5)* however, a number of sequences have only one or 
none of these two residues (Fig. 4). 



CNO 55 60 65 TO 75 80 CNO 

> < ><-> <- 

CHT AAHCGVT TSDVWAGEFDQGSSSE-KIQK CHT 

TRP AAHCYKS GIQVRLGQDNINV-VEGNQQF TRP 

ELA AAHCVDRE LT F RWVGE HtTLNQ — NNGTEQ Y ELA 

KAL AAHCK NDHYEVWLGRHNI.FE-NENTAQ.F KAL 

MCP AAHCRG REITVILGAHDVRK-AESTQQK MC? 

SGT AAHCVSGS GNNTS ITATGGWDLQS GAAVK "SGT 

2J2H AAHCYSN HYOVLLGRNNT.F-K-PEPFAOR 1QH 

HPHold TAKNLFLN="—HSENATAKDIA-PTLT1Y-— VGKKQL HPHold 

HPHnew TAKNLFLNHSENATAKDIAPTLTLYVGKK»-°— ~-»QL HPHnew 

> < ><-> <- 

COM AAHC V 1G CON 

Fig. 6. The old 3 - 4 and new sequence alignments for HPH, rel- 
ative to those of the known three-dimensional structures, for the 
region around the SCR 63-71 . Note that six residues have been 
moved from the VR after position 71 to the VR prior to position 63. 
Nomenclature and labeling are as in Figure 4. 



f 



; Fig. 4. Sequence alignment for many of the known serine pro- 
: : /tease sequences. The source of these various sequences and the 
' • . > definitions of the protein name codes used in this Figure are given 
. ; Y • - in Table II. The first seven proteins {above the line) are those with 
Known experimental three-dimensional structures (see Table I). 
V V ; their sequences have been aligned based on the superposition of 
■'-.^k?;"i:.pie" three-dirfiensional structures as described in the text. The 
; - ^remaining: proteins were aligned using the characteristic se- 
* 1 -Vquence matching patterns (see text). When the sequence align- 
. J i Wment is urfcertain, the respective sequence is shown in italics. The 
>> v >:IUPACrlUB convention standard single letter amino acid code. 




past the; last residue shown in the figure^ ■ 
^::^^^^^J?6sitlc»ijs ot nslatrve deletions in the sequences are denoted by a . 



dash in the known structures and by a double dash in the se- 
quences aligned using the characteristic homology patterns. The 
line labeled "CNO" gives the chymotrypsinogen residue number- 
ing used throughout this paper. The *+ symbols delineate the 
SCRs. The "CON" line lists the conserved, characteristic se- 
quence patterns used, to align the "new" sequences (see text). 
They are coded as follows: upper case- almost completely con- 
served side chain; lower case, high frequency of this amino acid 
at this .'position; X.-hdnpolarlresiduesi; typically A, V, L, I, M; it, 
denotes a/ polar residue, . typically :VS;i*JV Q; ; N P lus the charged 
residues; p, denotes S; or, T ( ;!(m^ by A occasion- 

ally); ^.denotes aromatic, usually -X --'or ^sometimes W; + , de- 
notes a positive residue sucH-as-Rv K; or- H; r-; denotes a negative 
residue such as D or E. The' line labeled. "S-S" gives the position 
olthe-h;a|f;^sfo^ to the half 

> cysiine;at,t)ieJabel position/A question m positions 
? ^efe^it l is;TOi, clear- td whteh.residu^ a ; "disulfide bridge is formed. 
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Fig. 7. Cot plots for the old (dotted lines) and new (solid lines) 
fE£T# H h« b3Sed ° n the respective '^tenmer^ sSJH 
63-71. Because the VR from 59 to 62 in the new HPH aHqnment 
£ EL!?i?^ ' 0n 3 er than in the Drevious alignment (Ffg.™ d 
5n55S^ rt Hl ?8 . ldUeS / '° n ¥ r than any of the current| y kn °wn 
S££2T* ♦ h th '2 ,oop < see Fi 9 s - 4 ^d 6, Table III), the confer- 
iTh,c h^f ^ S , b S e !3 drawn f0r this VR < dasned 'ines) is arbitrary. 
It has been included only to give a sense of the relatively larae 
s.ze of this loop. The difficulties associated wmitooJr^iffiSS 



Th^ o at !° n f0 i Such ,ar 9 e add '«ons are described in the text 
The VR at rescues 72-80 is highly truncated, seven residues 
shorter, m the new structure. Compare the con ormationTof "he 
known struc ures for the VRs at 59-62 and 72-80 (Fig 3) wrth 

£25- to ? IS w HPH ^ WhereaS tne 00 P° sitions (and ml)n cS 
coord.nates) for residues 63-71 are the same in this figure for the 

r^ifinn st ; uctures <* HPH. the side chains at each res dul 
position are, of course, different in the two alignments (Fiq 6) 
giving different overall structures for this part of the molecule as 



Alignment of "New" Serine 
Protease Sequences 

Identification of the characteristic sequence con- 
servation patterns for each SCR allows the process 
of aligning "new" sequences to begin. Figure 4 
shows the alignment of a large number of serine 
protease sequences to those of the known structures. 
The proteins included are taken from a very wide 
variety of functions and species. It is worth noting 
that three of these proteins, although clearly homol- 
ogous to the rest and therefore full members of the 
Sunny, are no longer functional serine proteases. 
These are the heavy chain of haptoglobin (HPH) 
protein Z (PRZ), and the a-chain of nerve growth 
factor (NGA). For all these proteins, the character- 
istic pattern for each SCR was located and matched. 
Ihen the remaining positions in that SCR were 
tilled, without permitting any additions or deletions 
One of the results that emerges immediately from 
this alignment (Fig. 4) is the strong reinforcement of 
the characteristic patterns in all these "new" se- 
quences. The same basic patterns appear in virtu- 
ally every protein, yet, in almost every pattern, 
there are individual protein sequences that have 
exceptions. Usually the exceptions are minor devia- 
tions from the theme of the conservation. For exam- 
q ^/ a o C «r° S ^ n four - re sidue stretch from 139 to 142, 
&/1/A-G-W-G, is present in 26 of 35 protein se- 
quences reported in Figure 4. The Trp is replaced by 
a Phe or z Tyr, both of which are typically accept- 



able replacements for a Trp, in four other proteins. 
In the same way, the characteristic sequence C-G-G 
almost always appears at positions 42-44, but some- 
times one of the Glys becomes an Ala, and occasion- 
ally the Cys is replaced by an alternative small side 
chain. It is clear that, with perhaps the occasional 
exception of the difficult SCRs between 63-71 and 
81-96 discussed above, all the sequences shown can 
be aligned relatively trivially and unambiguously to 
the respective characteristic conserved sequences in 
the SCRs (Fig. 4). 

The close examination of the SCRs at 63-71 and 
81-96 described in the previous section (Fig. 5), to- 
gether with the strong reinforcement of the charac- 
teristic sequence patterns in these SCRs (Fig. 4), has 
led to the realignment of the HPH sequence in this 
region. This results in a large change in the three- 
dimensional model structure for this protein from 
that previously proposed. 3 * 4 The old alignment (Fig. 
6) was forced by the need to avoid placing the 
charged Asp 65 in position 66, where it would be 
buried in the hydrophobic core of the N-terminal 
3-barrel. Pro was placed at position 69 as a replace- 
ment for a Gly that would permit a turn. Realization 
that the aliphatic residues at positions 66 and 68 
and the almost complete conservation of a Gly at 
position 69 (Figs. 4 and 5) make up the characteris- 
tic sequence pattern for this SCR led to the realign^ 
ment of the HPH sequence relative to those ;bfit&f 
known structures as shown in Figure 6. As a resist 
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Fig. 8. Ca plot for CHT is shown indicating the sites of VRs 
with large additions (dashed lines) and with large deletions (dotted 
lines) in CFB. The unusual residues described in the text are 
represented by solid lines. These residues include 16-20, 43, 
139-142, and 154-156. As can be seen, the unusual residues 
are localized in one region of the molecule around the normal site 



of the N-terminus of the protease peptide chain, residue 16. Res- 
idues 1 6 and 1 7 should not be in their usual positions, because 1 6 
is not N-termtnal in CFB or in CF2. Consequently, they really 
should be placed in the approximate position expected from our 
knowledge of the zymogen structures* 3 " 36 where the cleavage at 
16 has not yet occurred. 



of the new alignment, the HPH sequence has a very 
large addition in the VR at 59—62 and a very short 
loop in the VR at 72—80. Now, however, the require- 
ments of aliphatic residues at 66 and 68 and Gly at 
position 69 are completely satisfied. The difference 
between the old and new structures for HPH is il- 
lustrated in Figure 7, which demonstrates clearly 
that changes in the alignment can result in large 
changes to the derived structures. 

From an evolutionary perspective, deviation from 
the typical pattern of sequence conservation occurs 
f jjj most often in those members of the family that are 

no longer serine proteases in function. For example, 
HPH and PRZ are no longer serine proteases in 
function. Both have lost the active site Ser 195 (it is 
; . an Ala), and His 57 is also replaced by a Lys in HPH. 
.r ., Nevertheless, both proteins retain most of the char- 
:■ &cferistic sequence homology patterns, yet they de- 
;f:;; ; -Wafe 'inore than the typical sequence. Examples of 
:'-: v 1 this are the immediate area around the Ser 195 or 
;V,;|^ 1 ffieiSpR ! ^t 63&71. The Streptomyces griseus trypsin- 
; ;:4A^%iU^^ib^einvalso-- has- a weaker homology pattern 
•" \0:^ : j^^^k^ i rest^ as has been previously noted. 4 Even 
•j; %£gu^ arei the other bacterial serine 

oXvp^Sv?; ^^W^^M^^These--l&tter are not considered in; this 



considerably greater*difficulf^ ; . 

correctly;-:., by.-vtlifese'--- 



Analysis of the Variable Regions 

Once the sequences that correspond to the SCRs 
are identified and aligned, attention can be focussed 
on the remaining residues forming the VRs. Based 
on the alignment in Figure 4, the sequence lengths 
deduced for the various VRs are summarized in Ta- 
ble HI. It is clear from Table III that an amazing 
degree of variation occurs in the length of these 
loops and therefore in their conformations (Fig. 3). 

In previous studies, 4 we classified the modeling of 
the VRs into five different classes. 1) A known struc- 
ture has the same-length VR and the same residue 
character. In this case, a model structure is con- 
structed for the "new" VR by clipping in the respec- 
tive VR from the known structure. In such cases, the 
confidence level in this part of the structure can be 
reasonably good, especially if it is a short VR. One 
example occurs among the known structures of the 
serine proteases at residues 97-101, where the VRs 
of chymotrypsin (CHT), trypsin (TRP), and the mast 
cell protease (MCP) are all >the same length and 
have the same ^conformation (see this loop in Fig. 3). 
2) Different lei^hs. a^ among the 

known structures^ but^^e^/aP^ye^^s^^ struc- 
. turaL motif ■ ;:e;giv'a p-bend^tinectin£two; antiparalr 
lei g^stran^s^The applied struc- 
^tforals-moji^^^ be 
•;iikustratj^^ positions 35^41 
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the DSN protein in the SCR from 63 to 71. The usual 
Gly residue that invariably appears at position 69 is 
absent. This leads to an ambiguity regarding how to 
align the sequence. However, at least one of the pos- 
sibilities, the one shown in Figure 4, is likely to fit 
into the usual serine protease structure at this po- 
sition with only minor deviations. 

On the other hand, there are examples of larger 
changes, including the complete absence of one or 
more characteristic sequence patterns. This can be 
illustrated by the SCR at 134-145. This SCR has the 
highly conserved, very characteristic S/T/A-G-W-G 
sequence at 139-142, as noted above. Virtually ev- 
ery serine protease has this sequence, with a rare 
change from Trp to Tyr or Phe and a very rare 
change of the second Gly to something else (see PRZ 
for an example of divergence as discussed in the pre- 
vious paragraphs). Note, however, the sequence for 
this SCR in CFB and CF2 (Fig, 4). In these cases, the 
characteristic residues are completely missing. 
Clearly something is happening to the structure 
here that is unusual. The absence of these residues 
creates a serious problem in properly aligning these 
two sequences in this SCR. There is no obvious pat- 
tern of residues that would correspond to the S/T/ 
A-G-W-G sequence at 139-142 among the residues 
that are in this region of the sequence. It is possible, 
in principle, to place the sequence A-L/H-F-V at 
these positions, thereby replacing the conserved Trp 
with a Phe. However, this leaves internal positions 
136 and especially 138 with charged residues such 
as Asp and Lys, where nonpolar residues are char- 
acteristic and are called for by the structure. The 
sequences have been tentatively aligned as shown in 
Figure 4 to satisfy the requirements that internal 
residues be nonpolar, or at least uncharged- The un- 
certain alignments are shown in italics in Figure 4. 

If we look further at these two sequences, CFB 
and CF2, we find another anomalous site. The clas- 
sic, characteristic pattern for positions 42-44 is C- 
G-G. Examination of the sequences in Figure 4 
shows that substitution of one of the Gly residues by 
Ala occurs occasionally. The startling introduction 
of a much larger Met side chain at position 43 in 
CFB and an even larger and charged Arg in CF2 
indicates that once again something strange is hap- 
pening in these proteins. Examination of the loca- 
tion of -these changes in the serine protease three- 
dimensional structure (Fig. 8) shows that they are 
immediately adjacent to each other in the structure 
even though they are quite distant in the sequence. 
Hf-^A-. further anomalous sequence occurs at positions 
46-19 of these two sequences. Activation of a serine 
protease from, . its zymogen ; protein involves the 
...cleavage of the. peptide chain, prior to residue 16 to 
^generate a free amino terminus at this point in the 
^ttfoleeule^??. 2 ; ,The earliest, serine protease crystal 
■vstructure . solutions 24 ,showedvthat this amino termi- ; 
inm^fbrined; anj: internal;. saltv bridge ■ with the - side a- V: 



chain of Asp 194. The homology observed in the res- 
idues at positions 16-19 is likely to be diagnostic of 
this activation step in all these enzymes. Thus it is 
interesting that several of the sequences in Figure 4 
show major deviations from the characteristic se- 
quence pattern for these four residues. All the serine 
proteases have He or Val (or Leu) at positions 16 and 
17 and usually Gly at 18 and 19. Exceptions to this 
pattern are found in PRZ and NGA, neither of which 
is a true protease and thus presumably no longer 
retains an activation step. The two proteins CFB 
and CF2 are also striking in the absence of any ho- 
mology to the characteristic pattern for this stretch 
of residues or even between the two themselves. It is 
known that, unlike the other serine proteases, these 
two proteins are clipped some 200 residues N-ter- 
minal to residue 16. 63 - 54 Thus the activation mech- 
anism for these proteins must be different from that 
of the other serine proteases. Figure 8 shows that 
the normal location of the N-terminal residues 16 
and 17 is also close to the two unusual sequences 
described above for the three-dimensional structure- 
Thus significant violations of three spatially ad- 
jacent characteristic sequence patterns appear in 
these two proteins, coupled with functional signifi- 
cance expressed in the absence of the normal mech- 
anism of zymogen activation. It is interesting to note 
that CFB and CF2 occupy parallel functional roles 
in the alternative and classical pathways of the com- 
plement system, respectively. 55 They form the pro- 
teolytic subunits of the respective C3 convertases in 
the two pathways and, upon binding of C3b sub- 
units, change their specificity to become C5 conver- 
tases. Taken together, these observations suggest 
that some unusual three-dimensional structure is 
occurring in this particular region of the molecule 
involving the activation site and the immediately 
adjacent active and specificity sites. What is difficult 
to predict or determine is just how different these 
structures actually are from the typical serine pro- 
tease structural theme in this region. The excellent 
preservation of the characteristic sequence homol- 
ogy patterns everywhere else in the CFB and CF2 
sequences (Fig. 4) suggests strongly that the re- 
maining parts of the molecule are close to their nor- 
mal conformation. However, there is no such reas- 
surance for this part of the molecule. 

As one begins to construct this part of the CFB 
molecule onto the typical SCR framework of the 
serine proteases, several problems are encountered 
very rapidly. The introduction of side chains at po- 
sitions 140 and 43: in place of Gly, in such close prox- 
imity, causes a steric collision of these two groups. 
In constructing the N-terminal portion of the chain, 
, the conformation for. residues 16-19 must be: taken 
from -the v zymogen structure of chymotrypsin- 
vbgen;??^ff since these residues cannot reside in the 
;/.^^\Niterminai lie 16 binding pocket- This leaves 
/'the prpblenl.of'wh'at to do with Asp 194, the other 
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half of the normal salt bridge to the amino terminus 
at He 16. Both possible conformations of the aspar- 
tate, that of the zymogen and of the mature enzyme, 
must be considered based on zymogen crystal 
studies. 33-36 Unfortunately, neither conformation 
seems possible without a significant change in the 
local CFB structure. In the zymogen conformation, 
the side chains of Met 43 and Ala 140 (both usually 
glycines) prevent the Asp 194 side chain from occu- 
pying its usual place. In the mature enzyme form, it 
is the side chain of Phe 142 (also typically a glycine) 
that is too close to permit the usual conformer (see 
Fig. 8). 

The resulting changes that need to be made to 
accommodate these unusual side chains are difficult 
to predict. Does the structure undergo a large num- 
ber of small changes that results in the accommoda- 
tion of the above groups in close to the normal con- 
formation, or is there a significant change in this 
portion of the molecule that gives a very different 
conformation around Asp 194 and residues 139- 
142? It is not possible to answer this question using 
the energetics methods currently available. Such 
problems may have to wait either for experimental 
structure determinations or for improved under- 
standing of protein structure and energetics. There- 
fore, it is particularly important that comparative 
modeling methods, as described in this paper, allow 
us to recognize such regions of the molecule that are 
likely to be particularly difficult to construct reli- 
ably. 
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